Text Clustering Using Cosine Similarity and Matrix Factorization
نویسندگان
چکیده
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Text-clustering is to divide a collection of textdocuments into different categories so that documents in the same category describe the same topic such as classical music. Text Clustering efficiently groups documents with similar content into same cluster: Similarity between objects is measured within the use of similarity function. The hierarchical clustering schemes can be effectively used for processing large datasets. In this paper, it is proposed to use the hierarchical clustering technique entitled “sub leader algorithm” along with cosine similarity is to cluster the documents. Key words-Similarity measures, text clustering, data
منابع مشابه
GWU NLP at SemEval-2016 Shared Task 1: Matrix Factorization for Crosslingual STS
We present a matrix factorization model for learning cross-lingual representations for sentences. Using sentence-aligned corpora, the proposed model learns distributed representations by factoring the given data into language-dependent factors and one shared factor. As a result, input sentences from both languages can be mapped into fixed-length vectors and then compared directly using the cosi...
متن کاملDocument Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents
In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. The key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. The extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with...
متن کاملComparison Clustering using Cosine and Fuzzy set based Similarity Measures of Text Documents
Keeping in consideration the high demand for clustering, this paper focuses on understanding and implementing K-means clustering using two different similarity measures. We have tried to cluster the documents using two different measures rather than clustering it with Euclidean distance. Also a comparison is drawn based on accuracy of clustering between fuzzy and cosine similarity measure. The ...
متن کاملNonlocal Total Variation with Primal Dual Algorithm and Stable Simplex Clustering in Unsupervised Hyperspectral Imagery Analysis
We focus on implementing a nonlocal total variational method for unsupervised classification of hyperspectral imagery. We minimize the energy directly using a primal dual algorithm, which we modified for the non-local gradient and weighted centroid recalculation. By squaring the labeling function in the fidelity term before re-calculating the cluster centroids, we can implement an unsupervised ...
متن کاملText Document Clustering based on Phrase
Affinity propagation (AP) was recently introduced as an unsupervised learning algorithm for exemplar based clustering. In this paper novel text document clustering algorithm has been developed based on vector space model, phrases and affinity propagation clustering algorithm. Proposed algorithm can be called Phrase affinity clustering (PAC). PAC first finds the phrase by ukkonen suffix tree con...
متن کامل